Windows Azure : Exploring Full-Text Search (part 1) - Indexing

10/22/2010 6:07:05 PM

1. Understanding Full-Text Search

Imagine that you’re implementing a search engine for your website. You could write some simple database queries to search your content. You might use the LIKE operator in SQL to do simple pattern matching against the search queries. However, you’ll soon find out that this breaks down even in simple searches.

The terms users search for may not appear together. Users may search for variants of the term taxes where your database has only tax, or players where your database has only player. Since everyone is familiar with Google, your users will expect these search queries to work and to “do the right thing.” This is way beyond what you can do with a few simple SQL queries.

Performance would be terrible, since the database would need to look through every single row to find the data you want. Unlike a numeric column, you could not create an index on which the execution engine could perform a binary search to find the right row to go to.

Though you cannot build Google, you can provide rudimentary full-text search capability. This is available out of the box with most modern RDBMSs—SQL Server has had it for a couple of versions, and MySQL added it in version 3 and enhanced it considerably in later versions. This feature examines all the words in every stored document (where “document” refers to any content you stick in the database), and attempts to match that with the user’s queries.

Full-text search (FTS) engines are smart enough to recognize different versions of the same word. For example, searching for any of the words driven, drove, or drives should bring up drive as a result. FTS engines also know to detect similar phrases and perform rudimentary Boolean logic where you can search for the existence of multiple terms. They also typically contain rudimentary ranking algorithms to rank the results they find.

Another popular option is to use open source FTS projects such as Lucene that use the filesystem as a store. However, these don’t typically work on Windows Azure, or don’t fit into the stateless frontend model that the cloud demands, since they use the filesystem as a backend store.

Note: Although there are ways to change the backend store of these projects and make them store data elsewhere, they’re sparsely used at this point, and in general, they don’t work as well as using the filesystem. It would be quite difficult to optimize for Azure’s Table service, as we will explore throughout this discussion.

2. Indexing

Now let’s get to the fun part: what makes these FTS engines tick. No discussion on FTS engines can get far without a discussion on indexing—or, to be specific, a discussion on indexes.

You’re probably familiar with the role of an index in a book. If you turn to the back of this book, you’ll find one. If you look up the term full-text search, it should refer you back to this chapter. Now, try to imagine a book without an index. Finding a specific term in such a book would be a painful task. You would need to turn page by page, skim through the text, and try to guess in which chapter or section it appears. You may succeed quickly in a slim book, but imagine a book with several hundred pages, or even thousands of pages.

Searching through a database poses the same challenge, but on a much larger scale. Instead of having to read through thousands of pages, the computer must look through the equivalent of millions of pages (if not more). Even though databases are fast at looking through things, this is still too slow a process, and entails too much data to walk through sequentially.

The fix is simple. Create an index of the data in the database, just like the index at the back of this book, and have the computer look up the term in the index to find out where the data resides. This raises a question: what about the time it takes to look up the term in an index?

Think for a second about how you look up terms in a book’s index. Since the index is alphabetically sorted, you know whether to go backward or forward, and how many pages to skip. Similarly, the index in FTS engines is stored in a sorted order. The FTS engine can perform a binary search through the index to quickly arrive at the right term.

At this point, if you were using a typical RDBMS, you probably know as much about FTS indexes as you need to. In theory, you could just ask the database to start indexing specific columns in your tables. However, as mentioned earlier, Azure storage does not come with this out of the box. That means it is up to developers to do the heavy lifting to build these indexes and write the code to look through them.

You will have help when doing this. You can use Azure storage to store the actual data, and use Azure tables to query the index. This enables you to index and query over really large datasets without having to worry about memory or storage capacity issues.

Note: Will Azure storage have FTS sometime in the future? At the moment, there have been no announcements from Microsoft putting FTS on the official road map. However, the Azure team knows this is something a lot of customers want, and is actively investigating adding this in a future release. If you’re reading this book a few years after publication, chances are this feature may be part of the feature set, and you can skip over this entire section.

2.1. Documents and terms

Let’s become familiar with two terms that are common in the FTS world:

Document: A document is the unit of content that is returned in your search results. For web search engines such as Google and Bing, a document is a web page. If you’re building an email search, a document is an email. You can think of it as the thing that shows up in the query results. Documents can be of any granularity. If you were indexing books, you could make an entire book one big document. Or you could drill down and make each chapter, or each section, a document to enable you to get as close as possible to the exact location of the search term. The index at the back of this book uses individual pages as its “documents,” since the “result” is a page number you can turn to.
Term: A document is made up of several terms. Typically, for textual content, every term can be roughly approximated to a word—the previous sentence can be said to have eight terms. Note the word approximated, because you might be transforming the term before indexing to help with your search results.

Now let’s take a look at transforming.

2.2. Case folding and stemming

Following is a famous rhyme from the book The Lord of the Rings by J.R.R. Tolkien (George Allen and Unwin). This will be used for the next few examples.

Three Rings for the Elven-kings under the sky,
Seven for the Dwarf-lords in their halls of stone,
Nine for Mortal Men doomed to die,
One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie.
One Ring to rule them all, One Ring to find them,
One Ring to bring them all and in the darkness bind them
In the Land of Mordor where the Shadows lie.

Let’s assume for a moment that each word in this poem is a term, and each line is a document in itself. If a user searches for “one” (note the lack of capitalization), you would want to show all the lines starting with the text “One Ring.” To put it differently, users shouldn’t need to care about the case of the queries they’re typing in.

To support this, you must transform your documents by case-folding into a standard case, either all uppercase or all lowercase. This makes life much easier when searching for this text, since you don’t have to worry about the case of the actual content. For example, you would convert the first line in the poem into “three rings for the elven-kings under the sky.”

Variants on terms are a trickier issue. Look at the third line in the poem. If someone searched for the term doom, you should be able to recognize that doomed is a variant of doom. This is almost impossible to do at query time.

The right thing to do is to convert every word into its root form through a process called stemming. During indexing, you convert every word into a stemmed version. For example, instead of storing doomed, you would store the word doom. Plurals, tenses, and other word variants will be converted into one standard form. At query time, you convert the terms in the query into their stemmed version as well.

There are several well-known stemming algorithms with public, open source libraries. Throughout this chapter, you will use a simple, well-known implementation called the Porter Stemming Algorithm. You can find implementations of Porter’s algorithm for various languages (as well as information on the algorithm itself) at http://tartarus.org/~martin/PorterStemmer/.

Note: Big search engines such as Google and Yahoo! use complex stemming and case-folding algorithms that go beyond the simple process just described. For example, the preceding algorithms can’t deal with international input—most open source stemming libraries work only with English text. Even something as simple as case folding becomes tricky with some languages in which “cases” don’t work in the same way they do in the English language. If you’re interested in exploring these topics in detail (and exploring searching in general) you should pick up any text on information retrieval and start digging. It is a vast and rich field.

2.3. Inverted indexes

The correct term for the kind of index you’ll be building here is an inverted index (or reverse index). This is not a novel computer science data structure. Religious scholars have used them for centuries, and important works have had concordances created for them manually. (A concordance is an alphabetical listing of key terms in a work, with the identification or citation of the passages in which they occur.)

Note: One particular famous (but unintended) use of concordances involved the Dead Sea Scrolls, the only known surviving copies of Biblical documents made before 100 A.D. After the discovery of the Dead Sea Scrolls in the 1940s, the team that discovered and excavated the caves refused to publish most of the scrolls, but published a concordance that contained a mapping of terms to the scroll in which they appear. In 1991, a Cincinnati student entered the concordance into his computer and reconstructed the original underlying text. This caused the immediate release of the facsimile edition of the originals.

The best way to describe these indexes is with an example. They are typically made up of two data structures. One data structure contains a mapping of document IDs to the document’s contents. Using the earlier poem from The Lord of the Rings, this mapping would look like Table 1 if you assume every line to be a document by itself.

Table 1. Document ID to document mapping
Document ID	Document
0	Three Rings for the Elven-kings under the sky,
1	Seven for the Dwarf-lords in their halls of stone,
2	Nine for Mortal Men doomed to die,
3	One for the Dark Lord on his dark throne
4	In the Land of Mordor where the Shadows lie.
5	One Ring to rule them all, One Ring to find them,
6	One Ring to bring them all and in the darkness bind them
7	In the Land of Mordor where the Shadows lie.

An inverted index contains a list of pointers of terms to the documents in which they appear. The list of terms is exhaustive—if you add up all the terms in the inverted index, you will have a lexicon of all terms that appear in all the documents that were indexed. Table 2 shows a snippet of an inverted index constructed from Table 1 . (The full inverted index contains more than 45 terms.)

Table 2. Inverted index
Term	Document IDs
Three	0
Rings	0
for	0, 1, 2, 3
the	0, 0, 1, 3, 4, 4, 6, 7, 7
Elven-kings	0
under	0
sky,	0
Seven	1
Dwarf-lords	1
Mordor	4, 7
Ring	5, 5, 6

Looking at Table 2 , a few things are apparent. First, you can see that a few common words dominate in occurrence (in this case, the word the). These are called stop words, and are usually eliminated from queries, since they occur so frequently and tend to pull in more search results than necessary.

You also see that two terms are very similar to each other: Ring and Rings. These appear since there was no stemming in this table. In the code samples provided in this chapter, you’ll be stemming all terms to collapse every word into its root form.

At this point, you might have figured out how searching works. A query for a single term means a single look-up of the inverted index table, and a retrieval of the list of documents. A query that involves multiple terms (in which the user is looking for documents where all the terms appear) is performed by doing an intersection of the result sets for each term.

For example, consider the sample search “Rings the Elven-kings.” Using Table 11-2 , you know that the term Rings appears in document 0, Elven-kings appears in document 0, and the appears in quite a few documents, including document 0. The intersection of all of these is just one document (line)—0—the very first line of the poem.

Let’s work through another sample query: “Ring Mordor.” In this case, Ring maps to documents 5, 5, and 6, while Mordor maps to 4 and 7. Since there is no overlap, you have no results to display for this query.

This examination actually sidesteps a big part of showing search results: ranking. A lot of the “secret sauce” in large search engines such as Google is in ranking results. However, most users of FTS use it to display search results from a single website, or an otherwise limited source of data. Ranking results is difficult in such cases and algorithms to do so vary wildly—from examining where the search term figures in the text, to looking at page traffic for all the result pages. This also reflects the fact that ranking in FTS engines today is a bit of an unsolved problem.

Thus, this examination does not discuss ranking results. Once you have the list of results, let your imagination run wild on how to rank them.